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Abstract 

In this paper, we consider the classic measurement error regression scenario in which our independent, 
or design, variables are observed with several sources of additive noise. We will show that our motivating 
example's replicated measurements on both the design and dependent variables may be leveraged to enhance 
a sparse regression algorithm. Specifically, we estimate the variance and use it to scale our design variables. 
We demonstrate the efficacy of scaling from several points of view and validate it empirically with a biomass 
characterization data set using two of the most widely used sparse algorithms: least angle regression (LARS) 
and the Dantzig selector (DS). 

1 Introduction 

This paper is motivated by the practical problem of how to meaningfully perform sparse regression 
when the predictor variables are observed with measurement error or some source of uncertainty. We 
will refer to this error or noise as design uncertainty to emphasize that perturbations in the design 
matrix may arise from a number of random sources unrelated to experimental or measurement error 
per se. Recent work in this area has just begun to address the issue of sparse regression under design 
uncertainty from a theoretical point of view. We are primarily interested in describing an approach 
that, while theoretically justifiable, is essentially pragmatic and broadly applicable. In sh ort, we 
argue that greed - a basic feature of many sparsity promoting algorithms - is indeed good Troprj . 



12004 . so long as the design data is scaled by the uncertainty variances. We demonstrate the efficacy 
of scaling from several points of view and validate it empirically with a biomass characterization 
data set using two of the most widely used sparse algorithms: least angle regression (LARS) and 
the Dantzig selector (DS). 

Our work was motivated by an example from a biomass characterization experiment related to 
work at the National Renewable Energy Laboratory. The example is described in detail in Section|4] 
and contains repeated measurements of mass spectral (design, or predictor) and sugar mass fraction 
(response) values within each switchgrass sample. The domain scientists' goal was to find a small 
subset of masses in the spectrum that could be used to predict sugar mass fraction. We will show 
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that the replication of each measurement allows for simple estimates of the error variances which, 
in turn, may be used to guide the model selection procedure. Thus, we are interested in sparse 
regression under design uncertainty. We would also like for a scientist examining the model to 
have some hope of interpreting its meaning, either for immediate understanding or to indicate new 
research directions. 

Sparse regression by l\ minimization is a thriving and relatively young field of research. In 
the statistical inference l i teratu re, early stepwise-type algorithm s paved the way f or the now- 
familiar lasso Tibshiranl . 19961 ] . least angle regres sion (LARS) Efron et al. . 2004 1 , and many 



variants tailored to specific problems (for example, Yuan and Lin _ 2006| . Zou and Hastie 20od | 



variants tailored to specmc problems (tor example, liuan ana 
Tibshirani and Tavlorl [201l| jHastie et all |2007lj . iPercival et al.l 



2011 



A parallel evolution in the 
signal pr ocessing literature led to the developm ent of widely used basis a nd matching pur s uit al - 
gorithms [Chen et all Il998l l200ll lTroppll20o3 |. the Dantzig selector (DS) (Candes and Tad . 120071 ] . 



and many others (see, e.g., lEladl [2010J, Chapters 3 and 5, for a good overview). Despite their 



mostly independent development, the algorithms coming out of the s tatistical and signal processing 



worlds lead to remarkably similar results in many applications {e.g., iBickel et all [200 



Linear regression under the assumption of design uncertainty has, in comparison, a long history, 
going by various names such as error in v ariables or func t ional modeling, and a variety of techniques 



have been developed to address it {e.g., Gillard 2010| . Fuller 1987 , 1995]). Until fairly recently, 
however, much of the analysis of sparse representations has not confronted this issue. As we will 
discuss, there is good reason for this, namely, that this problem obfuscates the goal of sparse 
regression. 

Several recent works th at have looked at sparse regression under various assumptions about the 
noise should be mentioned. Rosenbaum and Tsvbakov 2010| |. d evelop a Dantzig-like es timator that 
they argue is more stable than the standard lasso or Dantzig. Sun and Zhangl [201lj describe an 
algorithm to estimate the lasso solution an d the noise level sim ultaneously. A similar idea, leading 
to the "adaptive lasso", was developed by Irluang et all 20081 under ho moscedastic assumptions. 
An algorithm that hybridizes total least squares Golub and Loan! 198C | a co mputational error in 



variables model, and the lasso was also recently published by iZhu et all 2011 



The work that comes nearest to our discussion is by IWagener and Dette [2011alb1 ] . In these 
papers, the authors present some asymptotic results for bridge and lasso estimators under the 
assumption of heteroscedasticity. In particular, they develop a weighting scheme that leads to 
adaptive lasso estimates that are sign consistent {i.e., they satisfy the "oracle property"). 

We consider this paper to be somewhat disjoint from the aforementioned for two reasons. First, 
we are primarily concerned with an approach that incorporates empirical knowledge of design 
uncertainty into the analysis. Second, we wish to argue from a more general, and necessarily 
more heuristic, point of vie w that does not require stringent conditions, such as those described by 
Wagener and Dette] 2011a | . Section 3, to hold. In other words, we want to allow for the possibility 
that the data that is given to us may be "messy." For example, we do not expect the design matrix 
to satisfy the restricted isometry property or to have low mutual coherence which, under certain 
circumstances, would guarantee the efficacy of an appropriate sparse algorithm. 

A central notion throughout this paper is that many of the standard sparse regression algorithms 
are greedy, that is, they search for a solution incrementally, using the best available update at any 
given point in the search. As such, we argue that estimates of uncertainty should modify the notion 
of greed. Some algorithms, such as orthogonal matching pursuit (OMP), basis pursuit (BP), and 
forward stagewise regression (FS), are explicitly greedy. Others, like those that solve the lasso and 
Dantzig selector problems, may also be viewed as greedy via their connection to homotopy methods 
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Asif and Rombergll2009ll2010llEfron et al.ll2004j . These methods generally take an initial estimate 



of the solution and move along a continuous path to the final one, choosing the best available search 
direction at each step. 

Initially, we take forward stagewi se (FS) regression a s our prototype for analysi s, noting its close 
relationship to the lasso and LARS jHastie et all 120071 ] . as well as OMP and BP (EladL l2010| . We 
show that for all solution paths of a fixed norm, the uncertainty of the residual and the solution 
norm have a dual-like relationship in which the homogeneity of one induces inhomogeneity of the 
other, and that one can move from one problem to the other via a scaling of the design variables. 
From the standpoint of sparse pursuit, we argue that, as a general principle, uniform growth of the 
uncertainty along the solution path is preferable to uniform growth of the solution norm. Similar 
arguments are shown to apply to the Dantzig selector (DS). We then compare LARS and DS 
cross-validated model selection on a repeated measures biomass characterization data set in which 
variances are estimated via an analysis of variance (ANOVA) model. In this application, scaling 
by the uncertainty variance leads to sparser and more accurate models. Prediction error is reduced 
even further if, after down-selection of the variables by LARS and DS, the solution is updated via 
an ?2 method such as ridge regression. 



2 Regression under design uncertainty 

In this section, we formulate the model of interest and outline the challenges posed by design 
uncertainty. More importantly, we derive a simple estimate of this quantity which will play a 
central role in the discussion. We also give a simple example that illustrates how a very sparse 
solution can sometimes be associated with more of the design uncertainty than a less sparse one, 
further motivating our approach. 

2.1 Model 

We consider response data of the form, 

Vi 

Wi 

e; 

Cov(«;, e) 

and design data, 



Vij+Sij, (2.2) 
0, Vj 

for 1 < i < n and 1 < j ' < p. The assumptions on w and Vj imply that the data are mean centered, 
and we interpret e and dj as independent uncertainties, 



= w l + e il 

~ N(0,a 2 e ), 
= 0, 



Xij — 
Vij ~ 

Cov(vj, dj) = 



CavMi) = 0, Vj, 
Cov(Sj,S k ) = 0, j^k, 
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arising from measurement error, natural within-sample variability, or other random sources. 
We will often express the system in matrix form, 

y = w + e, 

X = V + A, 

where e = (ei, £2, . . . , e n )' and A = [dij] n xp, and take the columns of X and y to be scaled to unit 
variance, leading to the constraints 

a 2 w +al = 1, 

<+4 = 1, Vj. 

Furthermore, we have 

Cov(X) = Cov<y) + £ 2 , 

where S 2 = Cov( A) = diag(cr 2 i , a 2 2 , . . . , tr 2 ) by the independence of the errors. When using finite 
sample estimates s 2 . of a 2 ., we denote the corresponding matrix by S 2 . In the absence of noise 
(a 2 = and a 2 . = 0, Vj), we assume that the design and response admit a linear model, 

w = V/3. (2.3) 

We are particularly interested in the case where (3 is sparse: loosely speaking, many of its elements 
are zero. 

In the application discussed in Section |4j repeated measurements are used to estimate the 
variances a 2 . . For a more theoretical application, it may be the case that these parameters are 
known exactly. Either way, for the remainder of the discussion we assume that either the variances 
or their sample estimates are available. 



2.2 The challenge of design uncertainty 

One intrinsic challenge in working with noisy design data is that the estimated regression coefficients 
are attenuated from their true values. Suppose, instead of (|2.3|) . we were to solve 



y = X/3 (2.4) 
via ordinary least squares (OLS) to obtain f3. For p = 1, it is straightforward to show that 

E{/3} _ a 



< 1, (2.5) 



where E denotes expected value. This implies that the estimators are biased towards zero by an 
amount that depends on the signal-to-noise ratio, a 2 , /a 2 . More generally, for any full rank X £ M. nxp 
with n > p, X may be diagonalized such that in the new system of coordinates an analogous result 
holds. 
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Design uncertainty also degrades the model fit even if the exact solution (3 is known. To see 
this, consider the residual error when design uncertainty is present: 

E{||y-X/3|||} = E{||(w + e)-(V + A)/3||2} 
= E{||6-A/3||i} 

= E{e'e - 2e'A/3 + /3'A'A/3} (2.6) 
= of + /3'E{AA}/3 
= of+/3'£ 2 /3 

where we have explicitly used (|2.3[) and the independence of the errors. 

This brings us to a main point, that the contribution of the design uncertainty to the residual 
is of the form which is quadratic in E. While we may only have access to the attenuated 

estimate (3 of /3, the structure of the residual error remains the same with respect to the error 
variances. We illustrate the effect this can have on sparse regression with a simple example. 

Example. Suppose p = 3 and 



1 1 1 

2'4'4/' 

The system admits the two solutions /3^' = (1,0,0)' and /3' 2 ' = ^7|(0, 1, 1)'- The first solution is 

the sparsest but in light of (|2.6[) has greater expected error since ||£/3 (1) || 2 = 1/2 while ||S/3 (2) || 2 = 
a/2/4. Hence, recovery of the sparsest solution results in greater uncertainty in the fit than the 
less sparse one. The issue becomes even more prominent - and more difficult to track - in higher 
dimensions with non-trivial covariance of the design matrix. 

Apparently, greed is not always good under design uncertainty. 

3 Scaling penalizes design uncertainty in the solution path 

In this section, we briefly describe a prototypical greedy algorithm for sparse regression, forward 
stagewise regression (FS). We do so because it is helpful to have a particular algorithm in mind for 
the discussion, and this one is particularly easy to understand. In addition, it solves the widely-used 
lasso optimization problem and thus is closely related to a variety of other important algorithms 
Efron et ail |2004 lHastie et all 12007 1. Next, we state the main result and provide simple algebraic 



w 


= «1, 






v 1 


■ * 


°l 


= 0, 


S 2 


= diag 



and geometric interpretations of it. Finally, we note implications of the result for the Dantzig 
selector problem. 



3.1 A prototypical pursuit algorithm: forward stagewise (FS) regression 

The FS algorithm may be summarized as follows: 
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1. Fix small 7 > and initialize: = 0, r = y. 

2. Identify the design variable x.,- most correlated with r. 

3. Incremental updated : /3j <— /3j + rjj, where r)j = 7 ■ sign(Corr(xj ■, r)). 

4. Subtract the projection of r onto x 3 : r r — rjjXj. 

5. If the residual norm is small enough, stop. Otherwise, return to step 2. 

Qualitatively, the algorithm finds the best search direction - the coordinate with highest residual 
correlation - and takes a small step in that direction. It does so iteratively, updating the solution 
and residual at each step, until the minimal residual error is reached. 

As an optimization procedure, FS regression (like the lasso and LARS) implicitly solves 

argmin -||y- Xf3\\ 2 subject to ||/3||i < A, (3.1) 

which is often expressed in Lagrangian form, 

^gunn^Wy-XpWl+XMlx, (3.2) 

for a range of values of the tuning parameter A > 0. In the limit A —> 0, the optimum is attained by 
the ord inary least square s solution, while solutions for A — > 00 are increasingly sparse (A = 00 



(3 = 0) |TibshiraniL[l996j . 



3.2 Main result 

Our main result is simple: it says that for all solutions of a fixed norm, the accumulated design 
uncertainty (estimated by ||E/3||2 in equation (I2.6[l ) is path-dependent unless the data are scaled 
by the uncertainty variance. In other words, scaling the data leads to a uniform increase of the 
design uncertainty contribution, independent of the search direction. 

To see this (and with a slight abuse of notation) , we first modify equation (|2.4I) to include scaling 
of the design variables, 

y = XD- 1 (3, (3.3) 
noting that if (3 solves (|2.3p . then Df3 solves p.3|) . The expected residual variance is then 

E{||y- XD^PWl} = a 2 e +E{(3'D- 1 A'AD- 1 f3} 

= ^ + ||ED- 1 /3||2. (3.4) 

Now let U(f3;D) = ||SD _1 /3||| denote the design uncertainty denoted with a solution (3 of 
fixed norm, ||/3||| = T 2 . Clearly, the uncertainty is independent of the uncertainty variances when 
D = IT 1 (i.e., D j:j = cr^ 1 ). Specifically, U(f3;D) = T 2 for any E. 

Scaling the data will result in solutions of different norms, so that two solutions of norm T under 
different scalings D^ 1 and D^ 1 are not directly comparable in terms of the underlying optimization 
problem. However, the result says that scaling by D = E leads to a solution space in which all 
solutions of identical norm have identical uncertainty. 



1 In LARS, the step is computed in a particularly efficient way but the final solution path is essentially the same. 
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3.2.1 Algebraic interpretation of scaling 

Based on our claim, we consider scaling the columns of X by the associated uncertainties, 

D = S, 

X <- XD- 1 . 

The most obvious effect of scaling is that the correlations change and so (potentially) does the order 
in which the variables are selected (step 2 of the FS algorithm). Recalling that the columns of X 
and y have unit variance, we initially have 

Corr(xy,y) = x^y, 

while after scaling, 

Corr(xj,y) = x^-y/cr 5j . 

A less obvious effect of scaling is that the underlying problem (13.21) is transformed so that 
uncertainty in the solution path is penalized explicitly. The scaled problem, 

arg min hy - XD^PWl + A||/3||i, (3.5) 

by a simple change of variables, f3 <— Df3, may be written 

arg mm i|ly - X/3||| + Aj|i>/3||x. 



We note that this is the "generalized lasso" problem described in iTibshirani and Tavlorl [201l| . 



The lasso penalty term represents the u l\ version" of the design uncertainty (recall that ||S/3||2 < 
||S/3||i < y/p\ | E/3 1 1 2, by norm equivalence). Hence, scaling by D = £ leads to a direct l\ penalization 
of design uncertainty within the lasso framework. 

3.2.2 Geometric interpretation of scaling 

Geometrically, scaling by the uncertainty induces a dual-like problem in which the homogeneity of 
solution norm and the uncertainty are reversed (Figure [3. In particular, before scaling, a step of 
fixed size 7 leads to constant growth of the l\ penalty but potentially non-uniform growth of the 
uncertainty. After scaling, on the other hand, the uncertainty grows uniformly at each step while 
the l\ penalty does not. 

Of course, for a given data set, the greedy algorithms we have discussed are not random but 
deterministic. But if we consider the task of sparse regression as applying to an ensemble of noisy 
data sets, one can think of the solution pat hs as being effectively random (for a similar line of 
reasoning see, e.g., iDonoho and Tsaigl [2008j ). That is, a statistical analysis of the algorithm is 



then necessarily and justifiably carried out in terms of expectations, rather than specific search 
paths. 

3.3 Connection to the Dantzig selector 

While a detailed analysis is beyond our scope, we take a brief m o ment to point out the connec- 
tion between scaling and the Dantzig selector. ICandes and Tao [2007] proposed an alternative 
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Without scaling 




Fig. 3.1: The dual-like relationship between uncertainty and solution norm. Left: Without scaling, 
uncertainty grows anisotropically (indicated by an ellipse around the origin) for fixed step 
size 7 of the greedy algorithm, while the solution norm grows uniformly (indicated by 
perpendicular contour lines on each axis). Right: After scaling by -D -1 , the uniformity of 
uncertainty and solution norm are reversed. The notation ||d(-)||i represents the increase 
in the l\ penalty term for a small step in the solution path. 
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formulation for sparse regression, 

argrmn||/3||i subject to ||X'(y - Xf3)\\oo < A<r e , 

where A > is a tuning parameter (different from the lasso parameter) and is the variance in 
(|2.ip . The Dantzig selector has two main features that distinguish it from other pursuit algorithms. 
The first is that the problem may be written explicitly as a linear program (LP), for instance, 

argmin Vol subject to ~a < (3 < a 

ex., (3 

and -ctAI < X'(y - Xf3) < a\l. 

The second is that the l<x> constraint is with respect to residual correlations as opposed to residual 
error. This seems intuitively correct since, in the presence of noise, we would expect the residual 
corresponding to an optimal solution to have exactly this property. 

Now consider the change of variables X 4— XD^ 1 , (3 4— D(3, and a <— Da as in Section f3. 2. II 
In the Dantzig context, this leads to the linear program: 

argmin l' (Da) subject to — a < f3 < a 

ct,f3 

and -o-XDl < X'(y - X/3) < crXDl. 

Notice that the feasible region is stretched along the noisier dimensions (proportionally to as j ) , 
resulting in relaxed requirements for the residual correlation in those coordinates. This is reasonable, 
as we would expect the accuracy for a given variable to be inversely related to its uncertainty. As 
in the lasso context, scaling also results in an explicit li penalization of the variables commensurate 
with their noise level via minimization of the quantity 1 (Da). 

Example. Continuing the example from Section |2~2| recall that 

J? u = 1/2, 

^22 — = 1/4, 

and that the uncertainty in the sparsest solution was greater than the next sparsest. Figure [3721 
gives a concrete illustration of the solution path for the scaled and unsealed data as well as the 
uncertainty in the fit (left panel). After three FS steps (identically for lasso and LARS), there is 
zero residual error for both the scaled and unsealed design (red lines, right panel). However, the 
uncertainty associated with f3 2 is less than that of (3 1 by a factor of 2 (black lines, right panel). 

4 Application to biomass characterization data 

In this section, we present results for both LARS and DS applied to a biomass characterization data 
set, with and without scaling. We highlight the challenges in working with this data, and illustrate 
the efficacy of scaling. 
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i 1 1 1 r 

1.0 1.5 2.0 2.5 3.0 

Step 




1.0 1.5 2.0 2.5 3.0 

Step 



Fig. 3.2: A toy example in which the sparsest solution has more uncertainty than the next sparsest 
one (see Section [2"T2j) . Left: the regression coefficients at each stage of the FS algorithm. 
Only non-zero coefficients are plotted. Right: the uncertainty as estimated by ||5/3||2 
(black), with residual sum of squares (RSS) on the right axis (red). The results are 
identical for lasso and LARS. 
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-1.0 -0.5 0.0 0.5 1.0 



Log 10 Total variance (s^ + Sj) 

Fig. 4.1: Graphical summary of the biomass characterization data. The top plot shows a typical raw 
MBMS spectrum. The inset shows, after pre-processing, the maximum cross-correlations 
for each peak and indicates a high degree of mutual dependence between the predictors. 
The bottom plot shows the distribution of variances estimated via a random one-way 
ANOVA model before normalization. The marker radius is proportional to mass-to-charge 
ratio of the peak, while the response ratio is indicated by a red triangle. A 1 : 1 ratio is 
indicated by a black line. Points lying further down and to the right of the solid black line 
have higher fidelity (i.e., higher signal-to- noise ratio). 
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4.1 Description of the data 

The characterization experiment we consider is motivated by a desire to quickly and inexpensively 
screen potential biofuel candidates for recalcitrance - a plant's natural resistance to releasing usable 
sugars - after a chemical or enzymatic pre-treatment. Here, n = 759 switchgrass plants were grown 
at different outdoor locations and in uncontrolled conditions. The predictors co nsist of p = 421 
pyrolysis molecular beam mass spectral (pyMBMS or MBMS, Svkes et al. 2009| ) lines measured 



twice for each sample. As each sample is pyrolyzed, the spectrometer counts the number of molecules 
that reach a detector over a range of mass-to-charge ratios. The raw spectrum for each sample is 
then normalized to have unit mass and each peak is divided by a standard (control) value measured 
during the same run, allowing samples from different experiments to be compared directly. So, after 
pre-processing, each peak may be thought of as being an expression level for that mass-to-charge 
ratio relative to a control. 

The response is the mass fraction of extractable glucose as inferred by the absorbance of 510-nm 



visible light, where each sample is measured in triplicate Selig et all 1201 lj . In this experiment, 
a previously validated linear model is calibrated via measurement of a pure glucose sample. The 
mass and absorbance of each biological sample from the same run are then input to the calibrated 
model, yielding an estimate of glucose mass fraction for that sample. 

The question we ask is: can the MBMS spectrum (a proxy for chemical composition) be used to 
predict the mass fraction of extractable glucose (usable biofuel)? To answer this, we seek a sparse 
linear model that incorporates estimates of uncertainty. Brief justifications for this approach are: 

• Sparse: The spectroscopy experiment results in high cross-correlations between the peaks 
because large masses break into smaller ones in a somewhat predictable way. Hence, we 
expect a significant amount of redundancy in the peaks. In addition, the relationship between 
mass spectral peaks and cell chemistry is complex, making a sparse model appealing in that 
it narrows the focus of future investigations to a few, rather than hundreds, of peaks and their 
associated compounds. 



Incorporates uncertainty: Some of the peaks are far noisier than others, leading to unequal 
uncertainties. We would like to ensure that the model depends on the noisy peaks as little as 
possible, without completely excluding them from consideration. 

Linear: The assumed physical model is one of linear mixture, i.e., doubling the concentration 
of an analyte in the sample should result in a doubling of its spectral signature. 



The data are summarized graphically in Figure 14.11 A typical raw mass spectrum is shown in 
the left panel where line height indicates count following convention for this field. The inset plot 
shows the maximum absolute cross-correlation of each peak with every other peak, from which we 
infer that there is a high degree of linear dependence among the variables, especially the smaller 
masses. In the right panel, the estimated total and within-sample ANOVA variances are shown 
before normalization or scaling, with equality indicated by a black line. The mass-to-charge ratios 
of the MBMS lines are proportional to the marker radius while glucose is indicated by a triangle. 
Clearly, many of the peaks are quite noisy, with almost all of the variance attributed to noise. 

4.2 Methods 



Model selection was performed using nested fc-fold cross validation (CV), in which standard fc-fold 
CV errors were averaged over 100/fc outer loops for k = 2,5, and 10. This approach ensured 
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that 100 different prediction models were validated for each choice of k 
Dantzig models for com parison. LARS models were fit 



We fit both LARS and 
Ft R F 201d| using the l ars package 



lHastie and Efronl 2011 1 and Dantzig in MATLAB using the Li Homo topv Toolbo x of Matl 
lAsif and Romberg! [20121 l2010j . 



2011 



ge 
1J. 



Eladl [2010| Chapter 8.5), it 



As has been suggested before (e.c 
can sometimes be beneficial to regress y on the sparse predictor set using another fitting procedure. 
For comparison, we used the LARS- an d Dantzig-selected peaks as input to cross validated ridge 



regression via the par cor package in R Kraemer and Schaefer . 2010l |. In all instances, the scaling 



matrix was estimated as part of the cross validation procedure (see Appendix for details). 



4.3 Results 

Cross-validation results are given in Table [5] in the Appendix, and may be summarized as follows. 
Scaling leads to: 

1. improved accuracy, as measured by cross- validated MSEP, for both LARS and DS (Figure 

2. increased sparsity for both LARS and DS (Figure PO]) 

3. higher degree of consistency between LARS and DS 

The left panel of Figure P3~2l shows the prediction error per LARS step (solid lines), the standard 
error (shaded regions), and the uncertainty as estimated by ||S/3||2 for k = 10 (dashed lines). The 
optimal models are indicated by x's. While the standard error of prediction is similar for the scaled 
and unsealed case, the uncertainty accumulates more slowly for the scaled input (almost identical 
results hold for k — 2, 5, not shown). The right panel provides a graphical impression of the quality 
of the variables selected for the scaled and unsealed data. One can see that, in general, the scaled 
approach (red diamonds) leads to selection of peaks with higher signal-to-noise ratio, indicated by 
green arrows, than the unsealed (blue circles). 

Figure 14.31 shows the sparsity of the LARS and DS models as a function of the CV fold sizes. 
Remarkably, the number of non-zero coefficients for both LARS and DS actually increases with 
increasing sample size when the data are unsealed. This is somewhat surprising since, heuristically, 
one would expect the model selection to be more discriminating as more samples are utilized. On 
the other hand, the number of non-zeros decreases with increasing sample size for the scaled data. 
To explain this, we speculate that when the data are unsealed, it is more likely for the algorithm 
to select variables that are either neutral or even detrimental with respect to prediction. If this is 
the case, then our results suggest that scaling leads to a more discriminating variable selection and 
higher prediction accuracy. 

Finally, while the LARS and DS solutions are not in perfect agreement in either case (scaled or 
unsealed), they are seen to be in better agreement after scaling. Only 46% of the LARS peaks are 
also selected by DS without scaling, while the number is 86% with scaling. Alternatively, of the 
total number of distinct peaks selected by LARS and DS combined, only 24% are common to both 
without scaling, while 45% are common to both with scaling. 



4.4 Discussion 

It should be stated up front that the assumption of linearity made in Section 14.11 does not appear 
to be completely valid. While the assumption should be valid on physical grounds, there are 
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Fig. 4.2: LARS cross validation results for k — 10. The top plot shows the mean square error of 
prediction (MSEP) in bold lines and standard error as shaded regions. The dashed lines 
are estimates of uncertainty via 1 1 <S"/3| 1 2 , with units on the right axis. The optimal model 
is indicated with an x. Apparently, scaling by uncertainty variances leads to a sparser and 
more accurate model, with less associated uncertainty. The bottom plot is identical to the 
one in Figure 14. 11 but with solution coordinates selected by LARS given different markers 
based on scaling (blue circles: unsealed, red diamonds: scaled). At least 4 peaks with high 
signal-to-noise are clearly selected after scaling that are not otherwise (green arrows, with 
numbers indicating the m/z ratio). 
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Fig. 4.3: Number of non-zero coefficients for LARS and DS for k = 2,5, f 0. In all cases, the 
prediction error decreases as k (i.e., the number of training samples) increases. Without 
scaling, the more accurate models use more variables while with scaling, remarkably, they 
use fewer. 
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obviously experimental, biological or other factors that introduce significant error terms beyond 
those formulated in Section [2TT1 That said, the fact that scaling leads to a reduction in CV error, 
increased sparsity, and better agreement between LARS and DS suggests that the method can still 
be practically useful under non-ideal circumstances. 

While some of the peaks identified by both scaled LARS and DS have been previously recog- 
nized as related to recalcitrance, many have not (see Table [3] in Appendix). Of particular interest 
are the peaks with large m/z ratios, as these are less likely to be correlated coincidentally with 
recalcitrance: light particles can originate from a variety of sources, but less so for larger particles. 
Furthermore, some of the unknown peaks have regression coefficients that are not small. We be- 
lieve that these results warrant taking a further look at the unknown peak associations to better 
understand chemical mechanisms of recalcitrance. 

5 Conclusions 

We have argued that sparse regression under design uncertainty presents several challenges that 
(to the best of our knowledge) have not been addressed in the literature. Focusing on the the 
uncertainty term, | |S/3| ||, in the residual error (|2.6I) . we propose a scaling of the design variables by 
their uncertainty variances. In the context of greedy algorithms, doing this guarantees a uniform 
growth of uncertainty regardless of the order in which the variables are selected. Within the lasso 
formulation, scaling is shown to enforce an li penalization of the uncertainty. In the Dantzig 
selector context, scaling leads to modified bounds on the residual error that reflect the amount of 
uncertainty associated with each variable. 

In a biomass characterization application, scaling is shown empirically to reduce uncertainty in 
the optimal solution. It also leads to sparser solutions and lower prediction error. The solution 
estimates are improved even further if the LARS- and Dantzig-selected peaks are used independently 
for ridge regression. In addition, these models are more consistent with one another after scaling, 
that is, they identify more of the same predictors. The improvements resulting from scaling are 
promising and deserve further consideration. 
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Appendix 
AN OVA model. 

We use a one way, random effects ANOVA modef to estimate the uncertainty variances. Let r 
denote the number of replicate measurements of a random variable Z and let n denote the number 
of samples. The relevant quantities needed to estimate the variance components are shown in Table 
Q] In particular, the estimates are sj = SSE/df, and s 2 v — (SSTr/df-cr^)/r. 



Source 


df 


Sum of squares 


Expected mean square 


Treatment 


n- 1 


SSTr = Y,i-xr{zi.-z..y 


ra'i + os 


Error (uncertainty) 


n(r — 1) 


sse = £^EU(^-^ 


4 



Tab. 1: Standard one-way, random effects ANOVA table. 



Cross validation procedure. 

For clarity, we outline our procedure for cross-validated model selection using replicated measure- 
ments. It is a completely standard cross-validation procedure with the simple addition that we 
estimate the uncertainty variances only from the training data. 
For each of the k cross-validation groups: 

1. Split the data into training, {ytraim X train }, and test sets, {ytest-, X tes t}, of appropriate sizes. 

2. Using only X tra in, estimate the error variances, s|., via a suitable method (we used one-way, 
random effects ANOVA). 

3. Form the diagonal matrix Djj = 8g. and scale the training data, X tra in ^— Xt ra i n D . 

4. Fit the desired models to y train using scaled X tra in- 

5. Using D from step 3, scale the test data, X tes t XtestD" 1 , and predict. 



Model selection method 


Scaling 


k 


# predictors 


MSEP 


Avg # predictors 


Avg MSEP 






2 


47 


0.564 






LARS 


NO 


5 


57 


0.551 


54.3 


0.555 






10 


59 


0.550 










2 


36 


0.539 






LARS 


YES 


5 


31 


0.531 


31.7 


0.533 






10 


28 


0.530 










2 


47 


0.508 






LARS-RR 


NO 


5 


57 


0.485 


54.3 


0.493 






10 


59 


0.485 










2 


36 


0.492 






LARS-RR 


YES 


5 


31 


0.491 


31.7 


0.492 






10 


28 


0.494 










2 


40 


0.622 






Dantzig selector 


NO 


5 


80 


0.583 


66.7 


0.599 






10 


80 


0.574 










2 


76 


0.547 






Dantzig selector 


YES 


5 


56 


0.528 


60.3 


0.536 






10 


49 


0.523 










2 


40 


0.512 






Dantzig-RR 


NO 


5 


80 


0.487 


66.7 


0.494 






10 


80 


0.482 










2 


76 


0.460 






Dantzig-RR 


YES 


5 


56 


0.454 


60.3 


0.455 






10 


49 


0.451 










2 


421 


0.533 






Ridge regression 


NO 


5 


421 


0.536 


421 


0.535 






10 


421 


0.535 










2 


421 


0.515 






Ridge regression 


YES 


5 


421 


0.517 


421 


0.516 






10 


421 


0.517 







Tab. 2: Results of fc-fold cross validation (see also Figure l4~3|) . The -RR suffix indicates ridge regression was performed on 
the subset selected by the corresponding sparse algorithm. For comparison, results for ridge regression using all of the 
predictor variables are also shown. 
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Vtlj z 


Assignment moyKes ct ai. zUUo| 


Avg. coefficient (relative to max) 


45 


? 


+1.0000 


60 


C5 sugars 


-0.0534 


120 


Vinylphenol 


-0.2519 


126 


? 


-0. 1432 


128 


? 


-0.0482 


129 


? 


-0.2177 


137 


Ethylguaiacol, homovanillin, coniferyl alcohol 


-0.7648 


143 


? 


+0. 1795 


144 


? 


-0.2338 


150 


Vinylguaiacol 


+0.8777 


159 


? 


+0.1016 


160 


? 


+0.1988 


164 


Allyl ± propenyl guaiacol 


-0.3966 


168 


4-Methyl-2, 6-dimethoxyphenol 


-0.3949 


175 


? 


+0.0554 


182 


Syringaldehyde 


-0.0734 


194 


4-Propenylsyringol 


-0.8661 


208 


Sinapyl aldehyde 


-0.0038 


210 


Sinapyl alcohol 


+0.4334 


226 


? 


+0.3021 


264 


? 


+0.1883 


287 


? 


-0.1646 


371 


? 


-0.1970 


374 


? 


+0.2153 



Tab. 3: P eaks identified by b oth scaled LARS and DS for k = 10. The peaks previously identified 
in[P 



. Svkes et al.l [2008] as significant to sugar release are described where possible. The LARS 
and DS regression coefficients were averaged and divided by the maximum to highlight the 
relative significance and sign of correlation of the peaks. Some of the most highly-weighted 
variables have not previously been identified as being related to recalcitrance. 
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